Web - scale Content Reuse Detection ( extended ) USC / ISI Technical Report ISI - TR - 692 , June 2014

نویسندگان

  • Calvin Ardi
  • John Heidemann
چکیده

With the vast amount of accessible, online content, it is not surprising that unscrupulous entities “borrow” from the web to provide filler for advertisements, link farms, and spam and make a quick profit. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically discover previously unknown duplicate content in the web, and the second to detect copies of discovered or manually identified content in the web. Our detection can also bad neighborhoods, clusters of pages where copied content is frequent. We verify our approach with controlled experiments with two large datasets: a Common Crawl subset the web, and a copy of Geocities, an older set of user-provided web content. We then demonstrate that we can discover otherwise unknown examples of duplication for spam, and detect both discovered and expert-identified content in these large datasets. Utilizing an original copy of Wikipedia as identified content, we find 40 sites that reuse this content, 86% for commercial benefit.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended)

Internet reliability has many potential weaknesses: fiber rights-of-way at the physical layer, exchange-point congestion from DDOS at the network layer, settlement disputes between organizations at the financial layer, and government intervention the political layer. This paper shows that we can discover common points-of-failure at any of these layers by observing correlated failures. We use en...

متن کامل

Census and Survey of the Visible Internet ( extended ) 0 USC / ISI Technical Report ISI - TR - 2008 - 649 b released

Prior measurement studies of the Internet have explored traffic and topology, but have largely ignored edge hosts. While the number of Internet hosts is very large, and many are hidden behind firewalls or in private address space, there is much to be learned from examining the population of visible hosts, those with public unicast addresses that respond to messages. In this paper we introduce t...

متن کامل

Census and Survey of the Visible Internet ( extended ) USC / ISI Technical Report ISI - TR - 2008 - 649

Prior measurement studies of the Internet have explored traffic and topology, but have largely ignored edge hosts. While the number of Internet hosts is very large, and many are hidden behind firewalls or in private address space, there is much to be learned from examining the population of visible hosts, those with public unicast addresses that respond to messages. In this paper we introduce t...

متن کامل

al . A . Shah , Solar Cells Photovoltaic Technology : The Case for Thin - Film

, 692 (1999); 285 Science et al. A. Shah, Solar Cells Photovoltaic Technology: The Case for Thin-Film www.sciencemag.org (this information is current as of December 16, 2006 ): The following resources related to this article are available online at http://www.sciencemag.org/cgi/content/full/285/5428/692 version of this article at: including high-resolution figures, can be found in the online Up...

متن کامل

File : draft - ietf - rsvp - md 5 - 07 . txt Bob Lindell USC / ISI Mohit Talwar USC / ISI

Cisco File: draft-ietf-rsvp-md5-07.txt Bob Lindell USC/ISI Mohit Talwar USC/ISI RSVP Cryptographic Authentication Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014